Add blob direct write with partitioned blob files by xingbowang · Pull Request #14457 · facebook/rocksdb

xingbowang · 2026-03-12T21:32:02Z

Summary

Add a new blob direct write feature with partitioned blob files that writes blob values directly to blob files during Put(), bypassing both WAL and memtable for large values. Only the small (~30 byte) BlobIndex pointer is stored in WAL and memtable. This reduces WAL write amplification, memtable memory usage, and blob write lock contention for large-value workloads.

Motivation

With standard blob separation, full blob values are first written to WAL, then stored in the memtable, and only separated into blob files during flush. For workloads with large values (e.g., 4KB–1MB), this means the WAL and memtable carry the full value payload even though it will eventually be stored separately. This wastes WAL bandwidth, inflates memtable memory, and adds unnecessary write amplification.

Additionally, the existing blob file write path uses a single blob file writer per column family, which becomes a serialization bottleneck under concurrent write workloads. Partitioned blob files address this by spreading writes across multiple independent blob files, each with its own lock, enabling true parallel blob I/O from multiple writer threads.

Design

Write Path

DBImpl::Put() fast path: For single-key puts where the value exceeds min_blob_size, the blob is written directly to a blob file and a BlobIndex-only WriteBatch is constructed, avoiding full value serialization entirely.
DBImpl::WriteImpl() batch path: For multi-key WriteBatch operations, a BlobWriteBatchTransformer iterates the batch, writes qualifying values to blob files, and replaces them with BlobIndex entries before the batch enters WAL/memtable.

BlobFilePartitionManager

A new BlobFilePartitionManager manages partitioned blob files for concurrent writes:

Partitioned writes: Multiple blob file partitions (configurable via blob_direct_write_partitions) each with their own mutex, reducing lock contention for concurrent writers.
Deferred flush mode (blob_direct_write_buffer_size > 0): Zero-copy buffering where Slice references point directly into the WriteBatch buffer. Background threads flush to disk in batches, amortizing syscall overhead. Includes backpressure with stall watermarks.
Sync mode (blob_direct_write_buffer_size = 0): Immediate write-through for maximum durability.
Pluggable partition strategy: Custom BlobFilePartitionStrategy interface for key/value-aware partition assignment (default: round-robin).

Flush Integration

On memtable flush, BlobFilePartitionManager::SealAllPartitions() finalizes open blob files and injects BlobFileAddition entries into the flush VersionEdit, so blob files are registered in the MANIFEST atomically with the flush SST.
Handles mempurge: if a flush is switched to mempurge, sealed blob file additions are returned to the partition manager for the next flush.

Crash Recovery

Orphan blob file recovery in DBImpl::Open(): Scans for blob files not registered in the MANIFEST (e.g., from crashes before flush), reads their headers to determine column family, validates records, and registers them via VersionEdit. Runs regardless of current enable_blob_direct_write setting to handle DBs previously opened with the feature.
WAL replay produces BlobIndex entries pointing to these recovered blob files, ensuring no data loss.

Read Path

DBIter and ArenaWrappedDBIter extended to resolve BlobIndex entries from direct-write blob files.
Deferred flush mode includes a 4-tier read fallback: pending records → in-flight records → BlobFileCache → blob file read.

New Options

enable_blob_direct_write (bool, default: false) — master switch
blob_direct_write_partitions (uint32, default: 1) — number of concurrent blob file partitions
blob_direct_write_buffer_size (uint64, default: 4MB) — per-partition write buffer; 0 = sync mode
blob_direct_write_use_direct_io (bool, default: false) — O_DIRECT for blob writes
blob_direct_write_flush_interval_ms (uint64, default: 0) — periodic background flush interval
blob_direct_write_partition_strategy (shared_ptr, default: round-robin)

New Statistics

BLOB_DB_DIRECT_WRITE_COUNT — number of blobs written via direct write
BLOB_DB_DIRECT_WRITE_BYTES — bytes written via direct write
BLOB_DB_DIRECT_WRITE_STALL_COUNT — writer stalls due to backpressure
BLOB_DB_COMPRESSION_MICROS — blob compression timing

Testing

61 new tests in db_blob_direct_write_test.cc covering: basic put/get, multi-get, concurrent writers, compression (with Snappy availability checks), crash recovery, orphan recovery, WAL recovery, snapshot isolation, transactions (including 2PC), backpressure, multiple column families, file rotation, statistics, event listeners, file checksums, direct I/O, sync/deferred flush modes, and error injection.
db_stress and db_crashtest.py integration for continuous randomized testing.
All existing blob tests updated to coexist with the new code paths.
Full make check passes (39,454 tests, 0 failures).

New Files

db/blob/blob_file_partition_manager.cc/.h — core partition manager (~1,700 lines)
db/blob/blob_write_batch_transformer.cc/.h — WriteBatch transformation logic
db/blob/db_blob_direct_write_test.cc — comprehensive test suite (~2,000 lines)
db/blob/blob_file_completion_callback.cc — SstFileManager and EventListener integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blob direct write with partitioned blob files#14457

Add blob direct write with partitioned blob files#14457
xingbowang wants to merge 15 commits intofacebook:mainfrom
xingbowang:2026_03_04_blob_memtable_partition

xingbowang commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xingbowang commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Design

Write Path

BlobFilePartitionManager

Flush Integration

Crash Recovery

Read Path

New Options

New Statistics

Testing

New Files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xingbowang commented Mar 12, 2026 •

edited

Loading